Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disentangle metric reporting from the actual reconciler. #2762

Merged
merged 1 commit into from
Jun 8, 2020

Conversation

mattmoor
Copy link
Member

@mattmoor mattmoor commented Jun 5, 2020

This changes the metric reporting to happen periodically vs. being
triggered by reconciliation. The previous method was prone to stale
data because it was driven by the informer cache immediately following
writes through the client, and might not fix itself since the status
may not change (even on resync every 10h). With this change it should
exhibit the correct value within 30s + {informer delay}, where the 30s
is configurable on the Recorders.

Fixes: #2729

@tekton-robot tekton-robot requested review from dibyom and a user June 5, 2020 14:13
@tekton-robot tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 5, 2020
@tekton-robot
Copy link
Collaborator

This PR cannot be merged: expecting exactly one kind/ label

Available kind/ labels are:

kind/bug: Categorizes issue or PR as related to a bug.
kind/flake: Categorizes issue or PR as related to a flakey test
kind/cleanup: Categorizes issue or PR as related to cleaning up code, process, or technical debt.
kind/design: Categorizes issue or PR as related to design.
kind/documentation: Categorizes issue or PR as related to documentation.
kind/feature: Categorizes issue or PR as related to a new feature.
kind/misc: Categorizes issue or PR as a miscellaneuous one.

@tekton-robot
Copy link
Collaborator

This PR cannot be merged: expecting exactly one kind/ label

Available kind/ labels are:

kind/bug: Categorizes issue or PR as related to a bug.
kind/flake: Categorizes issue or PR as related to a flakey test
kind/cleanup: Categorizes issue or PR as related to cleaning up code, process, or technical debt.
kind/design: Categorizes issue or PR as related to design.
kind/documentation: Categorizes issue or PR as related to documentation.
kind/feature: Categorizes issue or PR as related to a new feature.
kind/misc: Categorizes issue or PR as a miscellaneuous one.

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/metrics.go 86.8% 84.7% -2.0
pkg/reconciler/pipelinerun/pipelinerun.go 80.2% 79.9% -0.4
pkg/reconciler/taskrun/controller.go 96.2% 96.3% 0.1
pkg/reconciler/taskrun/metrics.go 86.7% 85.4% -1.2
pkg/reconciler/taskrun/taskrun.go 76.0% 75.4% -0.6

@mattmoor
Copy link
Member Author

mattmoor commented Jun 5, 2020

/kind bug

@tekton-robot tekton-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 5, 2020
Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM 👼
/cc @hrishin @afrittoli

}

// NewRecorder creates a new metrics recorder instance
// to log the PipelineRun related metrics
func NewRecorder() (*Recorder, error) {
r := &Recorder{
initialized: true,

// Default to 30s intervals.
ReportingPeriod: 30 * time.Second,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be nice if we keep this configurable? 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a public field, so it can be changed after the resource is created, but I see that this is just consumed by NewController without really having a place to hook into. I'm happy to whatever is conventional for Tekton here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure as well(I was away for some time). It's better if @afrittoli @vdemeester comment more on it.
In my opinion, we can have it as it is. If required in the future we can provide the relevant configuration option :)

However, maybe it's nice to document it? so users are aware of 30 seconds window to get updated metrics

Copy link
Member

@hrishin hrishin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall /LGTM, Thank you very much @mattmoor for the findings 🙇 . I didn't realize this could happen. One thing is I was under the impression that the client-go reconciles the state quite effectively by updating the lister's cache before firing the reconciling event but missed resync part.
However, wonders about 2 things 🤔

  1. Processing o(n) resources every time could result in any unwanted bitter side effect for the controller deployment? (You're right then deep down metrics implementation may need some redesigning).
  2. If resync could be a problem, isn't it a general reconciliation problem? (forgive me if I missing some point)

@mattmoor
Copy link
Member Author

mattmoor commented Jun 5, 2020

Processing o(n) resources every time could result in any unwanted bitter side effect for the controller deployment

O(n) isn't a change from before, but the way this is written now ensures that only one scan i happening at a time. Suppose "n" is huge and it takes 10 minutes to scan the lister cache.

This approach would just take 10:30 to expose the metric. The previous approach would leak go routines every time we update a single resources status. So in addition to being O(n) in runtime, it also might leak O(n) go routines[1] if all n pipelines are actively running.

[1] - I'm assuming there are roughly a constant number of status updates, but this could be much higher given that there is no deduplication by key.

If resync could be a problem, isn't it a general reconciliation problem?

It is. For global resyncs we jitter to avoid starving real work for a period of time. The default resync period is also every 10 hours to mitigate this, but that seems overly coarse for metrics resolution to be useful.

@mattmoor
Copy link
Member Author

mattmoor commented Jun 5, 2020

/retest

@hrishin
Copy link
Member

hrishin commented Jun 5, 2020

Thank you @mattmoor for the explanation.

O(n) isn't a change from before, but the way this is written now ensures that only one scan i happening at a time. Suppose "n" is huge and it takes 10 minutes to scan the lister cache.

O(n) holds true, sorry to rephrase it, I meant every time vs only when something changes. :)

The previous approach would leak go routines every time we update a single resources status.

True, as there wasn't serialization in place (frankly I was thinking to serialize it during the implementation using queue but later on felt that would be over-engineering) thus it could trigger the create stale metrics. However, the current PR approach deals better in that case.

[2]

The previous method was prone to stale
data because it was driven by the informer cache immediately following
writes through the client, and might not fix itself since the status
may not change (even on resync every 10h). With this change it should
exhibit the correct value within 30s + {informer delay}, where the 30s
is configurable on the Recorders.

On a different note, I'm still not able to get comment[2] and bit curious, would you like to elaborate on it?
As Comment mentions

Now, informers are fast, but in all likelihood the lister that is being used to drive this metric hasn't received the updated version of the resource from the API Server, so unless the informer cache's copy is being modified (a no no) then the cache won't reflect what it's clearly (given the if updated check) trying to capture.

However, I'm under the impression that either through the list or watch the updated copy of the resource is placed into the cache first then informer is notified about the updated resource which triggers reconciliation. Thus lister in the reconciler() refers the same copy from the cache which is up to the update. Is it right?

@mattmoor
Copy link
Member Author

mattmoor commented Jun 6, 2020

I'm under the impression that either through the list or watch the updated copy of the resource is placed into the cache first then informer is notified about the updated resource which triggers reconciliation

Yes, the informer model uses watches, lists, and changes on those lists to simulate events by observing changes on those lists through the watch. The premise of using the informer cache for reads during reconciliation is that the data is never a stale read of the change that motivated the reconciliation (because the event was generated by an update to the informer cache). It could still be stale if subsequent reads came in, but this is why "in-flight" work isn't deduplicated with "queued" work (e.g. even if a key is being processed it can be duplicated in the work queue).

However, in the case at HEAD, the metrics logic is keying off of if updated, which is NOT an informer event (e.g. observer-side where lister caches have been updated). So the sequence of events where this works is:

  1. update status (updated = true)
  2. informer QUICKLY observes the change and updates the lister cache (the key is also requeued)
  3. metrics go routine is spawned and sees correct data
  4. The key is processed again, nothing to do (updated = false)
  5. metrics go routine is not spawned because nothing changed.

2. -> 3. is the informer event happening first by chance, but there is no causal link here. We aren't updating metrics because the informer cache reflects the updated state.

The sequence of events where this doesn't work is:

  1. update status (updated = true)
  2. metrics go routine is spawned and sees INCORRECT data
  3. informer SLOWLY observes the change and updates the lister cache (the key is also requeued)
  4. The key is processed again, nothing to do (updated = false)
  5. metrics go routine is not spawned because nothing changed.

In the former case we get really lucky (informers are fast, but not "faster than a local go routine" fast). There used to be a bug that meant we didn't have to get lucky, but that is fixed at HEAD. Two other notable aspects to this:

  1. The metrics data is never reevaluated on a nop resync, so bad data stays bad until there's a change (and we get lucky enough that there isn't new bad data).
  2. When the controller process starts up, it won't surface this metric data until it observes a change.

Fundamentally what's at HEAD is "edge triggered" (keying off of deltas), vs. "level based" (changes that trigger more holistic reconsideration). It's not that "level based" is the only correct way to do things and it can seem wasteful to reconsider everything when we think we know exactly what changed, but it can be much more forgiving of errors and dropped events. I had some slides on this that it seems I never shared publicly and are locked up at Google somewhere 😞

Copy link
Member

@afrittoli afrittoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! It looks like a move into the right direction.
The more logic we manage to take out of the core reconciler the better, as it's becoming harder to maintain otherwise.
/approve

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 6, 2020
Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/meow

@tekton-robot
Copy link
Collaborator

@vdemeester: cat image

In response to this:

/lgtm
/meow

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 8, 2020
@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: afrittoli, hrishin, vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [afrittoli,vdemeester]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@afrittoli
Copy link
Member

@mattmoor I'm afraid this will need a rebase :)

This changes the metric reporting to happen periodically vs. being
triggered by reconciliation.  The previous method was prone to stale
data because it was driven by the informer cache immediately following
writes through the client, and might not fix itself since the status
may not change (even on resync every 10h).  With this change it should
exhibit the correct value within 30s + {informer delay}, where the 30s
is configurable on the Recorders.

Fixes: tektoncd#2729
@tekton-robot tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Jun 8, 2020
@mattmoor
Copy link
Member Author

mattmoor commented Jun 8, 2020

Rebased.

@mattmoor
Copy link
Member Author

mattmoor commented Jun 8, 2020

(will need a fresh /lgtm)

I'll work on rebasing the others now as well.

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/metrics.go 86.8% 84.7% -2.0
pkg/reconciler/pipelinerun/pipelinerun.go 81.5% 81.1% -0.4
pkg/reconciler/taskrun/controller.go 96.4% 96.6% 0.1
pkg/reconciler/taskrun/metrics.go 86.7% 85.4% -1.2
pkg/reconciler/taskrun/taskrun.go 76.1% 75.5% -0.6

@afrittoli
Copy link
Member

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 8, 2020
@tekton-robot tekton-robot merged commit 5aafdf9 into tektoncd:master Jun 8, 2020
@mattmoor mattmoor deleted the periodic-metrics branch June 8, 2020 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RunningPipelineRun metric is racy
5 participants